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Abstract 

We introduce a variant of (sparse) PCA in which 
the set of feasible support sets is determined by 
a graph. In particular, we consider the following 
setting; given a directed acyclic graph G on p 
vertices corresponding to variables, the non-zero 
entries of the extracted principal component must 
coincide with vertices lying along a path in G. 

From a statistical perspective, information on the 
underlying network may potentially reduce the 
number of observations required to recover the 
population principal component. We consider 
the canonical estimator which optimally exploits 
the prior knowledge by solving a non-convex 
quadratic maximization on the empirical covari¬ 
ance. We introduce a simple network and analyze 
the estimator under the spiked covariance model. 

We show that side information potentially im¬ 
proves the statistical complexity. 

We propose two algorithms to approximate the 
solution of the constrained quadratic maximiza¬ 
tion, and recover a component with the desired 
properties. We empirically evaluate our schemes 
on synthetic and real datasets. 

1. Introduction 

Principal Component Analysis (PCA) is an invaluable tool 
in data analysis and machine learning. Given a set of 
n centered p-dimensional datapoints Y S the first 

principal component is 

argmaxx^Sx, (1) 

l|x||2 = l 
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where S = i/n • YY^ is the empirical covariance matrix. 
The principal component spans the direction of maximum 
data variability. This direction usually involves all p vari¬ 
ables of the ambient space, in other words the PC vectors 
are typically non-sparse. However, it is often desirable to 
obtain a principal component with specific structure, for 
example limiting the support of non-zero entries. From 
a statistical viewpoint, in the high dimensional regime 
n = 0{p), the recovery of the true (population) principal 
component is only possible if additional structure infor¬ 
mation, like sparsity, is available for the former (|Amini &| 
|Wainwright||200^|Vu & Lei|[20T2] l. 

There are several approaches for extracting a sparse princi¬ 
pal component. Many rely on approximating the solution to 

maxx^Sx, subject to |jx ||2 = 1, ||x|jo < A:. (2) 

The non-convex quadratic optimization is NP hard (by a 
reduction from maximum clique problem), but optimally 
exploits the side information on the sparsity. 

Graph Path PCA. In this paper we enforce additional 
structure on the support of principal components. Consider 
a directed acyclic graph (DAG) G = {V,E) on p vertices. 
Let S and T be two additional special vertices and consider 
all simple paths from S' to T on the graph G. Ignoring the 
order of vertices along a path, let V{G) denote the collec¬ 
tion of all S-T paths in G. We seek the principal compo¬ 
nent supported on a path of G, i.e., the solution to 

max x^ Sx, (3) 

xeA(G) 

where 

X{G) = {xeM^ : ||x ||2 = 1, supp(x) S T’(G)}. (4) 

We will argue that this formulation can be used to impose 
several types of structure on the support of principal com¬ 
ponents. Note that the covariance matrix S and the graph 
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can be arbitrary: the matrix is capturing data correlations 
while the graph is a mathematical tool to efficiently de¬ 
scribe the possible supports of interest. We illustrate this 
through a few applications. 

Financial model selection: Consider the problem of iden¬ 
tifying which companies out of the S&P500 index capture 
most data variability. Running Sparse PCA with a spar¬ 
sity parameter k will select k companies that maximize 
explained variance. However, it may be useful to enforce 
more structure: If we must select one company from each 
business sector {e.g., Energy, Health Care,efc.) how could 
we identify these representative variables? 

In Section]^ we show that this additional requirement can 
be encoded using our graph path framework. We compare 
our variable selection with Sparse PCA and show that it 
leads to interpretable results. 

Biological and fMRI networks: Several problems involve 
variables that are naturally connected in a network. In 
these cases our Graph Path PCA can enforce interpretable 
sparsity structure, especially when the starting and ending 
points are manually selected by domain experts. In sec¬ 
tion]^ we apply our algorithm on fMRI data using a graph 
on regions of interest (ROIs) based on the Harvard-Oxford 
brain structural atlas ( |Desikan et al.[[200^ . 

We emphasize that our applications to brain data is prelimi¬ 
nary: the directional graphs we extract are simply based on 
distance and should not be interpreted as causality, simply 
as a way of encoding desired supports. 

What can we say about the tractability of ([^ ? We note that 
despite the additional constraints on the sparsity patterns, 
the number of admissible support sets (i.e. S-T paths) can 
be exponential in p, the number of variables. For example, 
consider a graph G as follows: S connected to two nodes 
who are then both connected to two nodes, etc. for k levels 
and finally connected to T. Clearly there are 2^ S-T paths 
and therefore a direct search is not tractable. 

Our Contributions; 


From a statistical viewpoint, we show that side in¬ 
formation on the underlying graph G can reduce the 
number of observations required to recover the popu¬ 
lation principal component x* G A' (G) via For 
our analysis, we introduce a simple, sparsity-inducing 
network model on p vertices partitioned into k lay¬ 
ers, with edges from one layer to the next, and max¬ 
imum out-degree d (Fig. [^. We show that n = 
0{logp/k -f fclogd) observations ~ A^(0, S), suf¬ 
fice to obtain an arbitrarily good estimate via Our 
proof follows the steps of (Vu & Lei 2012[). 


2. We complement this with an information-theoretic 
lower bound on the minimax estimation error, un- 








I 




Figure 1. A (p, k, d)-layer graph G = (V, E): a DAG on p vertices, 
partitioned into k disjoint sets (layers) Ci,..., The highlighted ver¬ 
tices form an S-T path. 


der the spiked covariance model with latent signal 
x,t G X{G), which matches the upper bound. 

3. We propose two algorithms for approximating the so- 


lution of based on those of (|Yuan & Zhang] 

2013 

and (Papailiopoulos et al.| 2013 Asteris et al.| 

2014 


for the sparse PCA problem. We empirically evaluate 
our algorithms on synthetic and real datasets. 


Related Work There is a large volume of work on al¬ 


gorithms and the statistical analysis of sparse PCA (John 


stone & Lu 

|2004t Zou et al. 

20061 d’Aspremont et al.l 

200812007 

Johnstone & Lu 2004 |Vu & Lei 20121 Amini| 


& Wainwright| 2009| l. On the contrary, there is limited 
work that considers additional structure on the sparsity pat¬ 
terns. Motivated by a face recognition application, Penat-| 
|ton et al.] |2010[ ) introduce structured sparse PCA using a 
regularization that encodes higher-order information about 
the data. The authors design sparsity inducing norms that 
further promote a pre-specified set of sparsity patterns. 

Finally, we note that the idea of pursuing additional struc¬ 
ture on top of sparsity is not limited to PCA: Model- 
based compressive sensing seeks sparse solutions under 


a restricted family of sparsity patterns ( Baldassarre et al. 
2013 Baraniuk et al.| 2010[ Kyrillidis & Cevher 2012| l^ 


while structure induced by an underlying network is found 
in (|Mairal & Yu||20lT]) for sparse linear regression. 


2. A Data Model - Sample Complexity 

The layer graph. Consider a directed acyclic graph 

G = (V, E) on p vertices, with the following properties: 

• V = {S,T} U V, where S' is a source vertex, T is a ter¬ 
minal one, and V is the set of remaining p — 2 vertices. 

• V can be partitioned into k disjoint subsets {layers) 
£i,... ,£fe, i.e., [jiCi = V, and A n Cj = 0, Vi, j G 
[k\,i 7 ^ j, such that: 

rout(r:) CZ Gj-i-i, Vu G Ci, for % — 1,..., A: 1, 
where rout(u) denotes the out-neighborhood of v. 
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- ro,.(5) = £i, and = {T}, \fv S 

In the sequel, for simplicity, we will further assume 
that p — 2 is a multiple of k and \Ci\= (p - 2)/fe, Vi G [k]. 
Further, |rout('y)| = d, Vt; G £i, i = 1,... ,k — 1, and 
|rin(t^)| = d, Vt; G i = 2,..., fc, where rin(u) denotes 
the in-neighborhood of v. In words, the edges from one 
layer are maximally spread accross the vertices of the next. 
We refer to G as a (p, k, d)-layer graph. 

Fig. [^illustrates a (p, k, (i)-layer graph G. The highlighted 
vertices form an S-T path tt: a set of vertices forming a trail 
from S to T. Let V{G) denote the collection of S-T paths 
in a graph G for a given pair of source and terminal vertices. 
For the (p, k, d)-layer graph, |7r| = /c, Vtt G 'P{G), and 

since d G {1,..., (p-2)/fe}. 


Spike along a path. We consider the spiked covariance 


model, as in the sparse PCA literature (Johnstone & Lu 


|2004t |Amini & Wainwright[ |2008[ ). Besides sparsity, we 
impose additional structure on the latent signal; structure 
induced by a (known) underlying graph G. 


Consider a p-dimensional signal x* and a bijective map¬ 
ping between the p variables in x* and the vertices of G. 
For simplicity, assume that the vertices of G are labeled so 
that Xi is associated with vertex i G L. We restrict x* in 


X{G) = {xgM^ : ||x ||2 = 1, supp(x) G T’(G)}, 


that is, X* is a unit-norm vector whose active (nonzero) 
entries correspond to vertices along a path in V{G). 

We observe n points (samples) G generated 

randomly and independently as follows; 

Yi = •x*-f Zi, (5) 

where the scaling coefficient Ui ~ 1) and the additive 

noise z^ ~ Ip) are independent. Equivalently, y^s are 
i.i.d. samples, distributed according to NiO, S), where 

S = Ip -f/3 • x*xj. (6) 


2.1. Lower bound 


for some /3 > 0. Let 22)^"'(x*) denote the product measure 
over the n independent draws. Consider the problem of es¬ 
timating 'x.i^from the n observations, given G. There exists 
X* G X(G) such that for every estimator x, 

Ei3J">(x.) [px^ - X*x* ||p] > 

^ ■ yminjl, ^ (log + I logd)}. ( 7 ) 


Theorem effectively states that for some latent sig¬ 
nal K^, G X{G), and observations generated according to 
the spiked covariance model, the minimax error is bounded 
away from zero, unless n = Lt (logp/fc -f fclogd). In the 
sequel, we provide a sketch proof of Theorem [1] following 
the steps of (Vu & Lei 2012[ ). 


The key idea is to discretize the space X{G) in order to 
utilize the Generalized Fano Inequality ( |Yu| |1997[ ). The 
next lemma summarizes Fano’s Inequality for the special 
case in which the n observations are distibuted according 
to the n-fold product measure I?)_"’(x*): 

Lemma 2.1 (Generalized Fano (|Yul |19971l). Let 
Xe C X{G) be a finite set of points Xi,... ,^\x^\ G X{G), 
each yielding a probability measure on the n 

observations. If d{xi,Xj) > a, for some pseudo-metri^ 
d{-,-) and the Kullback-Leibler divergences satisfy 


KL(l?<'*)(x,) II < 7, 


for all i j, then for any estimator x 


max E„(„) 


(x.) 


[d(x,x,)] 


> 


a 

2 


/ 7 -flog 2 \ 

I logiA-.!; 


• ( 8 ) 


Inequality ([^, using the pseudo-metric 

d (x, x) = ||xx^ — xx^ ||f, 

will yield the desired lower bound of Theorem on the 
minimax estimation error (Eq. (0). To that end, we need to 
show the existence of a sufficiently large set X (G) 

such that (i) the points in 44 are well separated under d{-, •), 
while (ii) the KL divergence of the induced probability 
measures is upper appropriately bounded. 

Lemma 2.2. (Local Packing) Consider a {p, k, d)-layer 
graph G on p vertices with fc > 4 and log d > A ■ H(f/f). 
For any e G (0,1], there exists a set 44 C 4:’(G) such that 

e/V2 < ||xi-Xj||2 < 


Theorem 1 (Lower Bound). Consider a {p, k, d)-layer 
graph G on p vertices, with k> A, and logd > AH{^/f). 
{Note that p — 2 > k ■ d), and a signal x* G 41(G). Let 
{yi}r=i sequence of n random observations, inde¬ 
pendently drawn according to probability density function 

I2p(x*) =A/'(0,Ip-|-^-x*xJ), 


for all Xi,Xj G 44, x^ f Xj, and 

log \X^\ > log -f 1/4 • ^ • log d. 

^ A pseudometric on a set ^ is a function d : R that satisfies 

all properties of a distance (non-negativity, symmetry, triangle inequality) 
except the identity of indiscemibles: d(q. q) = 0, Vq E Q but possibly 
d{cii , q 2 ) = 0 for some qi 7 ^ q 2 G Q- 
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Proof. (See Appendix[7]i. 


□ 


For a set with the properties of Lemma taking into 
account the fact that Ijx^x^ — x_,x^||f > ||xi — Xj ||2 
(Lemma A. 1.2 of ( |Vu & LeT||2012[ )), we have 


d (x„Xj) = ||x,x, - XjX^- Up > — = a . 


(9) 


VxijXj G Xg,Xi Xj. Moreover, 

KL(I?p(x,) II Op(x,)) = ^ . [(l + /l)x 


Tr(^(l - XjxJ)xixj ^ -Tr(^Xjx7(I - xtxj )^] 


_ X 

^ 4(1+/3) 


II T T||2 , II ||2 

Ilx.x, -X,X^ Up < 


In turn, for the n-fold product distribution, and taking into 
account that ||xi — Xj II 2 < v/2 ■ e, 


KL(l?W(x,) < 


2n/3^e^ 

(1 + /?) 


A 


= 7- 


( 10 ) 


Eq.@ and ([T 0 | l establish the parameters a and 7 required 
by Lemma |2.1| Substit uting those into ([^, along with the 
lower bound of Lemma 2.2 on |<141, we obtain 


max Ei,7,(,,,)[d(x,Xi)] > ^ 


<*) 2 <52 

^ 04 !)+logs 

log II 


. ( 11 ) 


The final step towards establishing the desired lower bound 
in Q is to appropriately choose e. One can verify that if 

e 2 = min{l, i (log ^ + 1- logd)} , ( 12 ) 

where C" > 0 is a constant to be determined, then 


26^/32 1 ^ 1 

(1 + /3) log|<T,| - 4 


and 


log|<T,| >41og2, (13) 


(see Appendix|^for details). Under the conditions in ( [T3] l, 
the inequality in ([TT|l implies that 


max Ei, 7 ,(x.)[d(x,x,)] > ^ • e. (14) 


At > A 2 > ..., and principal eigenvector x,^. Let S be 
the empirical covariance of the n samples, x the estimate 
ofXi, obtained via (|^, and e = ||xx^ — x^,xJ ||f. Then, 

Efe] < C ■ -• — • max-f VnA, , 

Ai — A 2 n L J 

where A = 0(log ^ 7 ^ + k logd). 

In the sequel, we provide a sketch proof of Theorem!^ The 
proof closely follows the steps of ( |Vu & Lei] |2012| l in de¬ 
veloping their upper bound for the the sparse PCA problem. 

Lemma 2.3 (Lemma 3.2.1 ( |Vu & L^ |2012] i). Consider 
S G with principal eigenvector x* and Agap=Ai — 

A 2 (S). For any X G with ||x ||2 = 1, 

-x*xj ||2 < (S, x*x 7 -XX^). 


Let X be an estimate of x* via Q, and e=||xx^ — x*xj Up. 
From Lemmait follows (see ( |Vu & LeTl|2012| l) that 

ApAa •e2<(S-S, Xx^-x*x7). (15) 


Both X* and x belong to X(G): unit-norm vectors, with 
support of cardinality k + 2 coinciding with a path in 7^(G). 
Their difference is supported in V^{G): the collection of 
sets formed by the union of two sets in V{G). Let X^{G) 
denote the set of unit norm vectors supported in V^{G). 
Via an appropriate upper bounding of the right-hand side 
of ( [T5| |, ( |Vu & LeTl|2012| l show that 




■E 


sup, 




| 6 >^(S - S) 6 »| 


for an appropriate constant G > 0. Further, under the as¬ 
sumptions on the data distribution, and utilizing a result due 
to ( [Mendels^ | 2010 | l, 

< G'K^ — maxj A^ } , 

for G' and K constants depending on the distribution, and 

A = Ey~jv(o,Ip) [supgg_;^,2 (Y, 0)] . (16) 


E 


sup | 6 /^(S-S) 6 >| 


Substituting e according to ([T^, yields the desired result 
in (|7]), completing the proof of Theorem[T] 

2.2. Upper bound 

Our upper bound is based on the estimator obtained via the 
constrained quadratic maximizaton in We note that the 
analysis is not restricted to the spiked covariance model; it 
applies to a broader class of distributions (see Assum.[T]l. 

Theorem 2 (Upper bound). Consider a {p, k, d)-layer 
graph G and x* G <T(G). Let be a sequence of n 

i.i.d. Af { 0 , S) samples, where S ^ 0 with eigenvalues 


This reduces the problem of bounding E[e] to bounding the 
supremum of a Gaussian process. Let Ns C <T2(G) be 
a minimal 5-covering of X'^iG) in the Euclidean metric 
with the property that Vx G X^{G), 3y G Ns such that 
||x — y ||2 < 5 and supp(x — y) S V'^{G). Then, 

supgg^2(Y,0) < (l-5)-i-m^x(Y,0). (17) 

Taking expectation w.r.t. Y and applying a union bound on 
the right hand side, we conclude 

A<G-(l-5)-^ Vlog|A 4 |. 


(18) 
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It remains to construct a J-covering Ms with the desired 
properties. To this end, we associate isometric copies of 
82 ^^^ with each support set in It is known that 

there exists a minimal 5-covering for with cardinal¬ 
ity at most (1 + The union of the local 5-nets 

forms a set Ms with the desired properties. Then, 

yog\Ms\ < log|iP2(G)|+2(fc + l)log(l + 2/^) 

= 0(log^ +klogd) , 

for any constant 5. Substituting in ( [TSl l, implies the desired 
bound on E[e], completing the proof of Theorem]^ 


3. Algorithmic approaches 


We propose two algorithms for approximating the solution 
of the constrained quadratic maximization in (|^; 


1. The first is an adaptation of the truncated power itera¬ 
tion method of ( |Yuan & Zhang[ |2013| l for the problem 
of computing sparse eigenvectors. 

2. The second relies on approximately s olving jS- on 


low rank approximation of S, similar to (Papailiopoulos 
|et al.|[^13[|Asteris et al.||2014| l. 


Both algorithms rely on a projection operation from 
onto the feasible set X[G), for a given graph G = (V, E). 
Besides the projection step, the algorithms are oblivious to 
the specifics of the constraint setj^and can adapt to differ¬ 
ent constraints by modifying the projection operation. 


3.1. Graph-Truncated Power Method 


Algorithm 1 Graph-Truncated Power Method 

input E G G = {V, E), xo G 

1: i 0 
2: repeat 
3: Wi t— Sxi 

4: Xi+i ^ Proj;(.(ej(wi) 

5: i t— i -I- 1 

6: until Convergence/Stop Criterion 

output Xi 


We consider a simple iterative procedure, similar to the 
truncated power method of ( Yuan & Zhang| 2013| l for the 
problem of computing sparse eigenvectors. Our algorithm 
produces sequence of vectors x^ G X(G), i > 0, that serve 
as intermediate estimates of the desired solution of (|^. 


The procedure is summarized in Algorithm In the ith it¬ 
eration, the current estimate x^ is multiplied by the empir¬ 
ical covariance S, The product G is projected back 


^For Alg. the observation holds under mild assumptions: X{G) 
must be such that ||x ||2 = 0(1), while ±x G A(G) should both achieve 
the same objective value. 


to the feasible set X{G), yielding the next estimate x^+i. 
The core of Algorithm [T] lies in the projection operation, 

ProjA'(G)(w) = argmin^ljx-w||^, (19) 

xeA(G) ^ 

which is analyzed separately in Section |3.3| The initial 
estimate Xq can be selected randomly or based on simple 
heuristics, e.g., the projection on X(G) of the column of S 
corresponding to the largest diagonal entry. The algorithm 
terminates when some convergence criterion is satisfied. 

The computational complexity (per iteration) of Algo¬ 
rithm is dominated by the cost of matrix-vector multi¬ 
plication and the projection step. The former is 0{k ■ p), 
where k is cardinality of the largest support in X{G). 
The projection operation for the particular set X (G), boils 
down to solving the longest path problem on a weighted 
variant of the DAG G (see Section |3.3| l, which can be 
solved in time 0(|y| -f \E\), i.e., linear in the size of G. 

3.2. Low-Dimensional Sample and Project 

The second algorithm outputs an estimate of the desired 
solution of ([^ by (approximately) solving the constrained 
quadratic maximization not on the original matrix S, but 
on a low rank approximation of S, instead: 

r r 

= E ( 20 ) 

2—1 2=1 

where Xi is the ith largest eigenvalue of S, is the cor¬ 
responding eigenvector, Vi=y/Ai • q^, and V is the p x r 
matrix whose ith column is equal to v^. The approximation 
rank r is an accuracy parameter; typically, r p. 

Our algorithm operate^on and seeks 

x^ = argmax x^ S^x. ( 21 ) 

xGAjG) 

The motivation is that an (approximate) solution for the 
low-rank problem in ( |2T] ) can be efficiently computed. In¬ 
tuitively, if Sf is a sufficiently good approximation of the 
original matrix S, then x^ would perform similarly to the 
solution X* of the original problem Q. 

The Algorithm. Our algorithm samples points from 
the low-dimensional principal subspace of S, and projects 
them on the feasible set X{G), producing a set of candi¬ 
date estimates for x^. It outputs the candidate that maxi¬ 
mizes the objective in (| 2 T|. The exact steps are formally 
presented in Algorithm]^ The following paragraphs delve 
into the details of Algorithm]^ 

^ Under the spiked covariance model, this approach may be asymptoti¬ 
cally unsuitable; as the ambient dimension increases, it with fail to recover 
the latent signal. Empirically, however, if the spectral decay of S is sharp, 
it yields very competitive results. 
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Algorithm 2 Low-Dimensional Sample and Project 

input S G G = (L, E), r G [p]. 

£ > 0 

1 

[Q, A] t— svd(S, r) 


2 

V £- qaV= 

{Sr=VV^} 

3 

C ^ 0 

{Candidate solutions} 

4 

for i = 1 : 0{e~'" ■ logp) do 


5 

Ci G- uniformly sampled from 

-1 

6 

Wi G- Vci 


7 

Xi ^ Proj;k.(Q)(wi) 


8 

C = CU{xi} 


9 

end for 


output Xr -1— argmax^^gf, ||V^x|j2 



3.2.1. The Low Rank Problem 

The rank-r maximization in ( |2T| can be written as 

max = max IIV^xllo, (22) 

xeA(G) xeA’(G) 

and in turn (see ( |Asteris et al.||2014l l for details), as a dou¬ 
ble maximization over the variables c S S''“^ and x G 

max IjV^xlln = max max ((Vc)^x) . (23) 

xGA’(G) cGS—i xGAr(G) ' ' 


such a net by random sampling. By definition, A4 contains 
at least one point, call it c^, in the vicinity of c^. It can be 
shown that the corresponding solution x(cr) in ( |24l i will 
perform approximately as well as the optimal solution x^, 
in terms of the quadratic objective in ( [2^ , for a large, but 
tractable, number of points in the e-net of 

3.3. The Projection Operator 

Algorithms [2 and rely on a projection operation from 
MF onto the feasible set X{G) (Eq. Q). We show that the 
projection effectively reduces to solving the longest path 
problem on (a weighted variant of) G. 

The projection operation, defined in Eq. ( [T^ , can be equiv- 
alentl}]^ written as 

Pj'ojA'fG)~ argmaxw^x. 

xeA(G) 

Eor any x G A(G), supp(x) G 'P{G). Eor a given set tt, by 
the Cauchy-Schwarz inequality, 

w,Xi < wf = w^l^, (27) 


The rank-1 case. Let w=Vc; w is only a vector in 
Eor given c and w, the x that maximizes the objective in 
© (as a function of c) is 

x(c) G argmax (w^x)^ . (24) 

xGA'(G) 

The maximization in ( |24l l is nothing but a rank-1 instance 
of the maximization in ( |22l l. Observe that if x G <T(G), 
then —X G X{G), and the two vectors attain the same ob¬ 
jective value. Hence, ( |24l l can be simplified; 

x(c) G argmaxw^x. (25) 

xSAJG) 

Eurther, since ||x||2 = 1, Vx G A’(G), the maximization 
in ( [25| l is equivalent to minimizing |||w — xjH. In other 
words, x(c) is just the projection of w G onto X{G): 

x(c) G Proj;i,((3)(w). (26) 

The projection operator is described in Section [331 

Multiple rank-1 instances. Let (c,., x^) denote a pair that 
attains the maximum value in ( |2^ . If was known, then 
Xr would coincide with the projection x(cj.) of w = Vc^ 
on the feasible set, according to ( |2^ . 

Of course, the optimal value of the auxiliary variable 
is not known. Recall, however, that lies on the low di¬ 
mensional manifold Consider an e-net covering 

the r-dimensional unit sphere Algorithm[^constructs 


where w G is the vector obtained by squaring the en¬ 
tries of w, l.e., Wi = wf, Vi G [n], and l^r G {0,1}^ de¬ 
notes the characteristic of tt. Letting x[ 7 r] denote the sub¬ 
vector of X supported on tt, equality in ( |27l ) can be achieved 
by X such that x[ 7 r] = w[ 7 r]/||w[ 7 r]|| 2 , and x[ 7 r'^] = 0. 

Hence, the problem in ( |27] l reduces to determining 

7r(w) G argmaxw^l^. (28) 

neviG) 


Consider a weighted graph Gw, obtained from G = {V,E) 
by assigning weight Wy = on vertex v G V. The objec¬ 
tive function in ( |28] l equals the weight of the path tt in Gw, 
l.e., the sum of weights of the vertices along tt. Determin¬ 
ing the optimal support 7r(w) for a given w, is equivalent 
to solving the longest (weighted)path probler^^on Gw. 


The longest (weighted) path problem is NP-hard on arbi¬ 
trary graphs. In the case of DAGs, however, it can be solved 
using standard algorithms relying on topological sorting in 
time 0(|L| + |i7|) ( jCormen et al. 20011, i.e., linear in the 
size of the graph. Hence, the projection x can be deter¬ 
mined in time 0{p + |i7|). 


^ It follows from expanding the quadratic ^ 11x — w11| and the fact that 
||x ||2 = l.Vx e X{G). 

^ The longest path problem is commonly defined on graphs with 
weighted edges instead of vertices. The latter is trivially transformed to 
the former: set w{u,v) <— w{v), V(u,v) £ E, where w(u,v) denotes 
the weight of edge (u, v), and w(v) that of vertex v. Auxiliary edges can 
be introduced for source vertices. 
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4. Experiments 

4.1. Synthetic Data. 

We evaluate Alg. [T]and[^on synthetic data, generated ac¬ 
cording to the model of Sec. We consider two metrics; 
the loss function ||xx^ — x*x*||f and the Support Jaccard 
distance between the true signal x* and the estimate x. 
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For dimension p, we generate a (p, k, d)-layer graph G, 
with k = logp and out-degree d = p/k, i.e., each vertex is 
connected to all vertices in the following layer. We aug¬ 
ment the graph with auxiliary source and terminal vertices 
S and T with edges to the original vertices as in Fig. 

Per random realization, we hrst construct a signal x,^ € 
X{G) as follows: we randomly select an S-T path tt in G, 
and assign random zero-mean Gaussian values to the en¬ 
tries of X* indexed by tt. The signal is scaled to unit length. 
Given x,^, we generate n independent samples according to 
the spiked covariance model in 

Fig. s depicts the aforementioned distance metrics as a 
function of the number n of observations. Results are the 
average of 100 independent realizations. We repeat the pro¬ 
cedure for multiple values of the ambient dimension p. 
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Figure 2. Metrics on the estimate x produced by Alg.[^(Alg.|^is sim¬ 
ilar) as a function of the sample number (average of 100 realizations). 
Samples are generated according to the spiked covariance model with 
signal X* S iY{G) for a (p, k, d)-layer graph G. Here, k = logp and 
d = p/k. We repeat for multiple values of p. 


Comparison with Sparse PCA. We compare the perfor¬ 
mance of Alg. [T]and Alg. [^with their sparse PCA coun¬ 


terparts: the Truncated Power Method of (Yuan & Zhang 


20131 and the Spannogram Alg. of (Papailiopoulos et al. 


20131, respectively. 


Fig. [^depicts the metrics of interest as a function of the 
number of samples, for all four algorithms. Here, sam¬ 
ples are drawn i.i.d from Ai(0, S), where S has princi¬ 
pal eigenvector equal to x*, and power law spectral decay; 
Xi = Results are an average of 100 realizations. 


The side information on the structure of x* assists the re¬ 
covery; both algorithms achieve improved performance 
compared to their sparse PCA counterparts. Here, the 
power method based algorithms exhibit inferior perfor¬ 


Figure 3. Estimation error between true signal x* and estimate x from 
n samples, (average of 100 realizations). Samples generated i.i.d. 
~ N{0, S), where S has eigenvalues A; = and principal eigen¬ 

vector X* S X{G), for a (p, k, d)-layer graph G. (p = 10®, k = 50, 
d = 10). 


mance, which may be attributed to poor initialization. We 
note, though, that at least for the size of these experiments, 
the power method algorithms are signihcantly faster. 

4.2. Finance Data. 

This dataset contains daily closing prices for 425 stocks 
of the S&P 500 Index, over a period of 1259 days (5- 
years): 02.01.2010 - 01.28.2015, collected from Yahoo! 
Finance. Stocks are classihed, according to the Global 
Industry Classification Standard (GICS), into 10 business 
sectors e.g.. Energy, Health Care, Information Technology, 
etc (see Fig.j^for the complete list). 

We seek a set of stocks comprising a single representative 
from each GICS sector, which captures most of the vari¬ 
ance in the dataset. Equivalently, we want to compute a 
structured principal component constrained to have exactly 
10 nonzero entries; one for each GICS sector. 

Consider a layer graph G = (V, E) (similar to the one 
depicted in Eig. on p = 425 vertices corresponding to 
the 425 stocks, partitioned into A: = 10 groups (layers) 
C V, corresponding to the GICS sectors. 
Each vertex in layer Ci has outgoing edges towards all (and 
only the) vertices in layer Ci+i. Note that (unlike Eig. [T]) 
layers do not have equal sizes, and the vertex out-degree 
varies across layers. Einally, we introduce auxiliary ver¬ 
tices S and T connected with the original graph as in Eig.[2 

Observe that any set of sector-representatives corresponds 
to an S-T path in G, and vice versa. Hence, the desired set 
of stocks can be obtained by hnding a structured principal 
component constrained to be supported along an S-T path 
in G. Note that the order of layers in G is irrelevant. 

Eig.|^depicts the subset of stocks selected by the proposed 
structure PCA algorithms (Alg. BS. A single representa¬ 
tive is selected from each sector. Eor comparison, we also 
run two corresponding algorithms for sparse PCA, with 
sparsity parameter A; = 10, equal to the number of sectors. 
As expected, the latter yield components achieving higher 
values of explained variance, but the selected stocks origi- 
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nate from only 5 out of the 10 sectors. 
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Figure 4. The figure depicts the sets of 10 stocks extracted by sparse 
PCA and our structure PCA approach. Sparse PCA (k = 10), selects 10 
stocks from 5 GICS sectors (above). On the contrary, our structured PCA 
algorithms yield a set of 10 stocks containing a representative from each 
sector (below) as desired. 


4.3. Neuroscience Data. 


late cortex and the prefrontal cortex ( |Greicius et al.[[2009) l. 
The nucleus accumbens receives input from the hippocam¬ 
pus, and plays an important role in memory consolida¬ 
tion ( |Wittmann et al. 2005) 1. It is noteworthy that our ap¬ 
proach has pinpointed the core neural components of the 
memory network, given minimal information. 



We use a single-session/single-participant resting state 
functional magnetic resonance imaging (resting state 
fMRI) dataset. The participant was not instructed 
to perform any explicit cognitive task throughout the 
scan (Van Essen et al. 2013| l. Data was provided W the 
Human Connectome Project, WU-Minn Consortium]^ 

Mean timeseries of n = 1200 points forp = 111 regions- 
of-interest (ROIs) are extracted based on the Harvard- 
Oxford Atlas ( jPesikan et al.| [2006) 1. The timescale of anal¬ 
ysis is restricted to 0.01-0.IHz. Based on recent results 
on resting state fMRI neural networks, we set the posterior 
cingulate cortex as a source node S, and the prefrontal cor¬ 
tex as a target node T (Greicius et al. 2009| l. Starting from 
S, we construct a layered graph with fc = 4, based on the 
physical (Euclidean) distances between the center of mass 
of the ROIs; i.e., given layer Ci, we construct Ci+i from 
non-selected nodes that are close in the Euclidean sense. 
Here, |£i| = 34 and \Ci\ = 25 for i = 2,3,4. Each 
layer is fully connected with its previous one. No further 
assumptions are derived from neurobiology. 


The extracted component suggests a directed pathway from 
the posterior cingulate cortex {S) to the prefrontal cor¬ 
tex (T), through the hippocampus (1), nucleus accum¬ 
bens (2), parahippocampal gyrus (3), and frontal opercu¬ 
lum (4) (Eig. 1^. Hippocampus and the parahippocam¬ 
pal gyrus are critical in memory encoding, and have been 
found to be structurally connected to the posterior cingu- 

®(Principal Investigators: David Van Essen and Kamil Ugurbil; 
1U54MH091657) funded by the 16 NIH Institutes and Centers that sup¬ 
port the NIH Blueprint for Neuroscience Research; and by the McDonnell 
Center for Systems Neuroscience at Washington University. 


Figure 5. We highlight the nodes extracted for the neuroscience example. 
Source node set to the posterior cingulate cortex (S: PCC), and target to 
the prefrontal cortex (T: Prefrontal). The directed path proceeded from the 
nucleus accumbens (1: NAcc), hippocampus (2: Hipp), parahippocampal 
gyrus (3: Parahipp), and to the frontal operculum (4: Operculum). Here, 
X coordinates (in mm) denote how fai‘ from the midline the cuts are. 


5. Conclusions 

We introduced a new problem; sparse PCA where the set of 
feasible support sets is determined by a graph on the vari¬ 
ables. We focused on the special case where feasible spar¬ 
sity patterns coincide with paths on the underlying graph. 
We provided an upper bound on the statistical complexity 
of the constrained quadratic maximization estimator 0, 
under a simple graph model, complemented with a lower 
bound on the minimax error. Einally, we proposed two al¬ 
gorithms to extract a component accommodating the graph 
constraints and applied them on real data from finance and 
neuroscience. 

A potential future direction is to expand the set of graph- 
induced sparsity patterns (beyond paths) that can lead to 
interpretable solutions and are computationally tractable. 
We hope this work triggers future efforts to introduce and 
exploit such underlying structure in diverse research fields. 
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7. Proof of Lemma 1^- Local Packing Set 

Towards the proof of Lemma |2.2| we develop a modified 
version of the Varshamov-Gilbert Lemma adapted to our 
specific model; the set of characteristic vectors of the S-T 
paths of a (p, k, c?)-layer graph G. 

Let Sh (x, y) denote the Hamming distance between two 
points X, y G {0,1}*’: 

<5ir(x,y) = |{j : x, ^ j/JI. 

Lemma 7.4. Consider a {p, k, d)-layer graph G on p ver¬ 
tices and the collection V{G) of S-T paths in G. Let 

H = {x G {0,1}^ : supp{x) G V{G)}, 

i.e., the set of characteristic vectors of all S-T paths in G. 
For every ^ G (0,1), there exists a set, C H such that 


The intersection yB(a), r)nf2 corresponds to S-T paths in G 
that have at least k — additional vertices in common 
with tD besides vertices 1 and p that are common to all paths 
in LI: 


B{uj, r) n H 

= {w G H < r} 

= G LI : |{* : oji = uj^ = 1}| > fc — § “f 2} , 

where the last equality is due to ( [3T] i. In fact, due to the 
structure of G, the set r) n H corresponds to the S-T 
paths that meet u) in at least k — '’/2 intermediate layers. 
Taking into account that |rin(u)| = |rout('u)| = d, for all 
vertices vmV (G) (except those in the first and last layer), 


\B{i:d,r)nLl\< 



r 

• d2. 


<5ff(x,y) > 2(1 - • fc, \fx,y £Ll^,xfy, (29) 

and 

loglH^I >log£^ + (^fc-l)•logd-A•i^(e), (30) 
where is the binary entropy function. 


Now, consider a maximal set Ll^ C LI satisfying ( |29] l, i.e., a 
set that cannot be augmented by any other point in LI. The 
union of balls B{i.o, 2(1 — ^) ■ (k — 1)) over all tu G 
covers LI. To verify that, note that if there exists ut' G Lt\Ll^ 
such that > 2(1 — ^) ■ {k — 1), Vo; G Ll^, then 

satisfies ( |29l l contradicting the maximality of Ll^. 
Based on the above. 


Proof. Consider a labeling 1,... ,p of the p vertices in G, 
such that variable uJi is associated with vertex i. Each point 
u; G H is the characteristic vector of a set in 7^(G); nonzero 
entries of uj correspond to vertices along an S-T path in G. 
With a slight abuse of notation, we refer to u; as a path 
in G. Due to the structure of the (p, k, d)-layer graph G, all 
points in H have exactly fc + 2 nonzero entries, i.e., 

6h{(^, 0) = k-\-2, Vw G H. 

Each vertex in uj lies in a distinct layer of G. In turn, for 
any pair of points u),uj' G LI, 

SH{u^,ut') = 2-{k-\{i: w, =a;' = l}|-2). (31) 

Note that the Hamming distance between the two points is 
a linear function of the number of their common nonzero 
entries, while it can take only even values with a maximum 
value of 2fc. 


|H| < ^ |S(a;,2(l-0-fc)nH| 

< E ' 

< \Ll^\ ■ 

Taking into account that 

\Ll\ = \V{G)\ = ^-^-d^-\ 

we conclude that 

from which the desired result follows. □ 


Without loss of generality, let S and T corresponding to 
vertices 1 and p, respectively. Then, the above imply that 

uj\ = ojp = 1, Vut G LI. 

Consider a fixed point <2 G LI, and let B{Q,r) denote the 
Hamming ball of radius r centered at cD, i.e., 

B(fi,r) = {u;G{0, 1}^: (5//(tD,a;) < r}. 


Lemma 2.2 (Local Packing) Consider a (p, k, d)-layer 
graph G on p vertices with fc > 4 and log d> A ■ H (2/4). 
For any e G (0,1], there exists a set C X(G) such that 

e/y/2 < ||xi-Xj ||2 < V2 ■ e, 

for all Xi,Xj G <14, x^ f Xp and 

log|<T,| > log+ T • log 
k 4 
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Proof. Without loss of generality, consider a labeling 
1,... of the p vertices in G, such that S and T corre¬ 
spond to vertices 1 and p, respectively. Let 


n = {x G {0,1}^ : supp(x) G V{G)'\, 

where V{G) is the collection of S-T paths in G. By 
and for ^ = 3/4, there exists a set flj C fl such 


Lemma 

that 


7.4 


> - ■ k, 


(32) 


^ ’ 3.n(l, 

log|f^5l > log^ -f (I • fc - l) logd- A: -77(1) 
>^oS^ + l-k-\ogd-k- 77(1) 

> log ^ j • fc • log 7 (33) 


where the second and third inequalites hold under the as¬ 
sumptions of the lemma; A: > 4 and log 7 > 4 • 77 ( 3 / 4 ). 

Consider the bijective mapping ijj : —i' M.P defined as 





Vk 





We show that the set 


Xe = {^/’(w) : LJ G rif} . 

has the desired properties. First, to verify that X^ is a subset 
of X[G), note that Vu) G flj C 17, 

supp(V’(w)) = supp(u;) G V{G), (34) 


and 


(^ — p2'| ,2 P-1 

i=1 

Second, for all pairs of points x^, x^ G X^, 

g2 g2 

|jxi - yijWl = ■ — <2 ■ k ■ — = 2 ■ . 

The inequality follows from the fact that (u;, 0) = A: -f 2 
wi = 1 and ujp = 1, Vo; G 17^, and in turn 

<2- k. 

Similarly, for all pairs x^, Xj G X^, x^ 7 ^ Xj, 

1 £2 £2 

||x, -Xj ||2 = SH{uJ^,UJj) ' J - 2 ' ^ 

where the inequality is due to ( |32l ). Finally, the lower bound 
on the cardinality of X^ follows immediately from ( [3^ and 
the fact that \X^\ = jn^j, which completes the proof. □ 


8. Details in proof of Lemma 

We want to show that if 


£^ = min< 1, 


C" ■ (1 + P) log\^ + I ■ log 7 
/32 n 


for an appropriate choice of C > 0, then the following two 
conditions (Eq. 0) are satisfied: 


n 


2£2/32 1 1 

(l + /3)log|A’,| - 4 “ 


log|A’e| > 4log2. 


For the second inequality, recall that by Lemma 

log|A’e| > log-f i • fclog7 > 0. (35) 

k 4 

Under the assumptions of Thm.[T]on the parameters k and 7 
(note that p — 2 > A: • 7 by the structure of G), 

log|A’,| > log^^-^ -f j ■ log7 > 4 • 77 ( 3 / 4 ) > 41og2, 
k 4 

which is the desired result. 

For the first inequality, we consider two cases: 

• First, we consider the case where = 1, i.e.. 


G' • (1 +/3) log ^ -f I ■ log 7 
/32 n 


Equivalently, 


n 


2£2^2 

(1 + / 3 ) 


< 2 • C" • 



-f 



(36) 


• In the second case. 


C'-(l+/3) logg^ + ^log7 
/32 n ’ 


which implies that 


n 


2£2/32 

(1 + / 3 ) 


= 2 • C" • 



-f - • log 7 


(37) 


Combining ([36]l or ([JT]), with (|T5l), we obtain 


n 


2£2/32 1 

{1 + /3) log|A/| 


<2G'< 


1 

4 


for C’ < 1 / 8 . 

We conclude that for e chosen as in the conditions 
in ( [T3| ) hold. 
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9. Other 


and 


Assumption 1. There exist i.i.d. random vectors 
zi,..., z„ € MP, such that Ez^ = 0 and Ez^z^ = Ip, 


sup |jz7x||^2 <(39) 

xesr' 


y = M + 




(38) 


where fj, € MT and K > 0 is a constant depending on the 
distribution of ZiS. 




